25 research outputs found

    Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol

    Full text link
    Distributed Machine Learning (DML) systems are utilized to enhance the speed of model training in data centers (DCs) and edge nodes. The Parameter Server (PS) communication architecture is commonly employed, but it faces severe long-tail latency caused by many-to-one "incast" traffic patterns, negatively impacting training throughput. To address this challenge, we design the \textbf{L}oss-tolerant \textbf{T}ransmission \textbf{P}rotocol (LTP), which permits partial loss of gradients during synchronization to avoid unneeded retransmission and contributes to faster synchronization per iteration. LTP implements loss-tolerant transmission through \textit{out-of-order transmission} and \textit{out-of-order Acknowledges (ACKs)}. LTP employs \textit{Early Close} to adjust the loss-tolerant threshold based on network conditions and \textit{Bubble Filling} for data correction to maintain training accuracy. LTP is implemented by C++ and integrated into PyTorch. Evaluations on a testbed of 8 worker nodes and one PS node demonstrate that LTP can significantly improve DML training task throughput by up to 30x compared to traditional TCP congestion controls, with no sacrifice to final accuracy.Comment: This paper will be published on IWQoS 2023. Preview version onl

    OSP: Boosting Distributed Model Training with 2-stage Synchronization

    Full text link
    Distributed deep learning (DDL) is a promising research area, which aims to increase the efficiency of training deep learning tasks with large size of datasets and models. As the computation capability of DDL nodes continues to increase, the network connection between nodes is becoming a major bottleneck. Various methods of gradient compression and improved model synchronization have been proposed to address this bottleneck in Parameter-Server-based DDL. However, these two types of methods can result in accuracy loss due to discarded gradients and have limited enhancement on the throughput of model synchronization, respectively. To address these challenges, we propose a new model synchronization method named Overlapped Synchronization Parallel (OSP), which achieves efficient communication with a 2-stage synchronization approach and uses Local-Gradient-based Parameter correction (LGP) to avoid accuracy loss caused by stale parameters. The prototype of OSP has been implemented using PyTorch and evaluated on commonly used deep learning models and datasets with a 9-node testbed. Evaluation results show that OSP can achieve up to 50\% improvement in throughput without accuracy loss compared to popular synchronization models.Comment: Copyright Owner/Author | ACM 2023. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will be published in ICPP 202

    Influence of spike defect on the impulse breakdown characteristics of SF 6

    No full text

    Study on heat fluxes and their effect on electrode material removal on the copper electrode of a field-distortion gas spark switch

    No full text
    Gas switch is one of the key elements in pulsed-power devices, and electrode erosion is a key restrictive factor in high-power gas switch development and application. According to the thermal equilibrium equation near the electrode surface, this paper calculates electrode heat fluxes and their peak powers caused under different discharge conditions, and analyses their effects on the removal method of electrode materials. When discharge current is not too high, calculation results indicate that the arc joule heat is the main cause of electrode erosion and solid material removal may appear. Vaporization of electrode material is the main cause of electrode erosion when discharge current is high enough

    Robustness testing for software components

    Get PDF
    AbstractComponent-based development allows one to build software from existing components and promises to improve software reuse and reduce costs. For critical applications, the user of a component must ensure that it fits the requirements of the application. To achieve this, testing is a well-suited means when the source code of the components is not available. Robustness testing is a testing methodology to detect the vulnerabilities of a component under unexpected inputs or in a stressful environment. As components may fail differently in different states, we use a state machine based approach to robustness testing. First, a set of paths is generated to cover transitions of the state machine, and it is used by the test cases to bring the component into a specific control state. Second, method calls with invalid inputs are fed to the component in different states to test the robustness. By traversing the paths, the test cases cover more states and transitions compared to stateless API testing. We apply our approach to several components, including open source software, and compare our results with existing approaches

    Failure times prediction of field-distortion gas switch based on electrode surface roughness

    No full text
    Prediction of switch failure times has a great influence on determining the repair cycle of gas switch and pulse power system, preventing accidents, and reducing cost. In this paper, electrode surface roughness (ESR) is proposed to analyze switch performance and predict switch failure times. According to the one-dimensional equation of heat conduction and the thermal equilibrium equation near the electrode surface, the etch pit depth can be calculated with different discharge conditions. And then, the electrode surface roughness has been obtained by calculating the deepest etch pits depth and the burr peak height in the electrode erosion region for the same discharge condition. The switch failure times can be predicted by the trend of the ESR according to discharge times. Experimental results indicate that the calculation model of switch failure times can be used to predict the switch failure times effectively

    Machine Learning Aided Prediction of Glass-Forming Ability of Metallic Glass

    No full text
    The prediction of the glass-forming ability (GFA) of metallic glasses (MGs) can accelerate the efficiency of their development. In this paper, a dataset was constructed using experimental data collected from the literature and books, and a machine learning-based predictive model was established to predict the GFA. Firstly, a classification model based on the size of the critical diameter (Dmax) was established to determine whether an alloy system could form a glass state, with an accuracy rating of 0.98. Then, regression models were established to predict the crystallization temperature (Tx), glass transition temperature (Tg), and liquidus temperature (Tl) of MGs. The R2 of the prediction model obtained in the test set was greater than 0.89, which showed that the model had good prediction accuracy. The key features used by the regression models were analyzed using variance, correlation, embedding, recursive, and exhaustive methods to select the most important features. Furthermore, to improve the interpretability of the prediction model, feature importance, partial dependence plot (PDP), and individual conditional expectation (ICE) methods were used for visualization analysis, demonstrating how features affect the target variables. Finally, taking Zr-Cu-Ni-Al system MGs as an example, a prediction model was established using a genetic algorithm to optimize the alloy composition for high GFA in the compositional space, achieving the optimal design of alloy composition

    Prediction of the Fatigue Strength of Steel Based on Interpretable Machine Learning

    No full text
    Most failures in steel materials are due to fatigue damage, so it is of great significance to analyze the key features of fatigue strength (FS) in order to improve fatigue performance. This study collected data on the fatigue strength of steel materials and established a predictive model for FS based on machine learning (ML). Three feature-construction strategies were proposed based on the dataset, and compared on four typical ML algorithms. The combination of Strategy Ⅲ (composition, heat-treatment, and atomic features) and the GBT algorithm showed the best performance. Subsequently, input features were selected step by step using methods such as the analysis of variance (ANOVA), embedded method, recursive method, and exhaustive method. The key features affecting FS were found to be TT, mE, APID, and Mo. Based on these key features and Bayesian optimization, an ML model was established, which showed a good performance. Finally, Shapley additive explanations (SHAP) and symbolic regression (SR) are introduced to improve the interpretability of the prediction model. It had been discovered through SHAP analysis that TT and Mo had the most significant impact on FS. Specifically, it was observed that 160 0.15 was beneficial for increasing the value of FS. SR was used to establish a significant mathematical relationship between these key features and FS

    Optimal Design of the Austenitic Stainless-Steel Composition Based on Machine Learning and Genetic Algorithm

    No full text
    As the fourth paradigm of materials research and development, the materials genome paradigm can significantly improve the efficiency of research and development for austenitic stainless steel. In this study, by collecting experimental data of austenitic stainless steel, the chemical composition of austenitic stainless steel is optimized by machine learning and a genetic algorithm, so that the production cost is reduced, and the research and development of new steel grades is accelerated without reducing the mechanical properties. Specifically, four machine learning prediction models were established for different mechanical properties, with the gradient boosting regression (gbr) algorithm demonstrating superior prediction accuracy compared to other commonly used machine learning algorithms. Bayesian optimization was then employed to optimize the hyperparameters in the gbr algorithm, resulting in the identification of the optimal combination of hyperparameters. The mechanical properties prediction model established at this stage had good prediction accuracy on the test set (yield strength: R2 = 0.88, MAE = 4.89 MPa; ultimate tensile strength: R2 = 0.99, MAE = 2.65 MPa; elongation: R2 = 0.84, MAE = 1.42%; reduction in area: R2 = 0.88, MAE = 1.39%). Moreover, feature importance and Shapley Additive Explanation (SHAP) values were utilized to analyze the interpretability of the performance prediction models and to assess how the features influence the overall performance. Finally, the NSGA-III algorithm was used to simultaneously maximize the mechanical property prediction models within the search space, thereby obtaining the corresponding non-dominated solution set of chemical composition and achieving the optimization of austenitic stainless-steel compositions

    Low-Jitter Discharge of a Plasma-Jet Triggered Gas Switch at Low Working Coefficients

    No full text
    corecore